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Abstract 

Recently there has been an increasing interest in methods that deal with multiple out- 
puts. This has been motivated partly by frameworks like multitask learning, multisensor 
networks or structured output data. From a Gaussian processes perspective, the prob- 
lem reduces to specifying an appropriate covariance function that, whilst being positive 
semi-definite, captures the dependencies between all the data points and across all the 
outputs. One approach to account for non-trivial correlations between outputs employs 
convolution processes. Under a latent function interpretation of the convolution transform 
we establish dependencies between output variables. The main drawbacks of this approach 
are the associated computational and storage demands. In this paper we address these is- 
sues. We present different sparse approximations for dependent output Gaussian processes 
constructed through the convolution formalism. We exploit the conditional independencies 
present naturally in the model. This leads to a form of the covariance similar in spirit to 
the so called PITC and FITC approximations for a single output. We show experimental 
results with synthetic and real data, in particular, we show results in pollution prediction, 
school exams score prediction and gene expression data. 

Keywords: Gaussian processes, convolution processes, sparse approximations, multitask 
learning, structured outputs, multivariate processes. 



1. Introduction 



Accounting for dependencies between model outputs has important applications in sev- 
eral areas. In sensor networks, for example, missing signals from temporal failing sensors 



may be predicted due to correlations with signals acquired from other sensors (Osborne 



et al. 2008 1 . In geostatistics, prediction of the concentration of heavy pollutant metals (for 



example, Copper), that are expensive to measure, can be done using inexpensive and over- 
sampled variables (for example, pH) as a proxy. Within the machine learning community 
this approach is sometimes known as multitask learning. The idea in multitask learning is 
that information shared between the tasks leads to improved performance in comparison to 



learning the same tasks individually (Caruana 1997). 



In this paper, we consider the problem of modeling related outputs in a Gaussian process 
(GP). A Gaussian process specifies a prior distribution over functions. When using a GP for 
multiple related outputs, our purpose is to develop a prior that expresses correlation between 
the outputs. This information is encoded in the covariance function. The class of valid 
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covariance functions is the same as the class of reproducing kernels Q Such kernel functions 
for single outputs are widely studied in machine learning (see, for example, Rasmussen and 



Williams 2006 ) . More recently the community has begun to turn its attention to covariance 



functions for multiple outputs. One of the paradigms that has been considered (Teh et al. 



2005 Osborne et al. 2008 Bonilla et al. 2008 ) is known in the geostatistics literature as the 



linear model of coregionalization (LMC) (Journel and Huijbregts 1978 Goovaerts 1997 1 . 
In the LMC, the covariance function is expressed as the sum of Kronecker products between 
coregionalization matrices and a set of underlying covariance functions. The correlations 
across the outputs are expressed in the coregionalization matrices, while the underlying 
covariance functions express the correlation between different data points. 

Multitask learning has also been approached from the perspective of regularization theory 



(Evgeniou and Pontil 2004; Evgeniou et al. 20051. These multitask kernels are obtained 



as generalizations of the regularization theory to vector-valued functions. They can also be 
seen as examples of LMCs applied to linear transformations of the input space. 

The linear model of coregionalization is a rather restrictive approach to constructing 
multiple output covariance functions. Each output can be thought of as an instantaneous 
mixing of the underlying signals/processes. An alternative approach to constructing covari- 
ance functions for multiple outputs employs convolution processes (CP). To obtain a CP in 
the single output case, the output of a given process is convolved with a smoothing kernel 
function. For example, a white noise process may be convolved with a smoothing kernel to 



obtain a covariance function (Barry and Hoef 1996; Hoef and Barry 1998). Higdon (2002) 



noted that if a single input process was convolved with different smoothing kernels to pro- 
duce different outputs, then correlation between the outputs could be expressed. This idea 



was introduced to the machine learning audience by Boyle and Frean ( 2005 ) . We can think 



of this approach to generating multiple output covariance functions as a non-instantaneous 
mixing of the base processes. 

The convolution process framework is an elegant way for constructing dependent output 
processes. However, it comes at the price of having to consider the full covariance function 
of the joint GP. For D output dimensions and N data points the covariance matrix scales as 
DN leading to 0(N s D 3 ) computational complexity and 0(N 2 D 2 ) storage. We are inter- 
ested in exploiting the richer class of covariance structures allowed by the CP framework, 
but reducing the additional computational overhead they imply. 

In this paper, we propose different sparse approximations for the full covariance matrix 
involved in the multiple output convolution process. We exploit the fact that, in the convo- 
lution framework, each of the outputs is conditional independent of all others if the input 
process is fully observed. This leads to an approximation that turns out to be strongly 



related to the partially independent training conditional (PITC) (Quihonero Candela and 



Rasmussen 2005) approximation for a single output GP. This analogy inspires us to con- 



sider a further conditional independence assumption across data points. This leads to an 
approximation which shares the form of the fully independent training conditional (FITC) 



approximation (Snelson and Ghahramani 2006 Quihonero Candela and Rasmussenl 2005). 



1. In this paper we will use kernel to refer to both reproducing kernels and smoothing kernels. Reproducing 
kernels are those used in machine learning that conform to Mercer's theorem. Smoothing kernels are 
kernel functions which are convolved with a signal to create a smoothed version of that signal. 
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This reduces computational complexity to 0(NDK 2 ) and storage to O(NDK) with K 
representing a user specified value for the number of inducing points in the approximation. 

The rest of the paper is organized as follows. First we give a more detailed review of 
related work, with a particular focus on relating multiple output work in machine learning to 
other fields. Despite the fact that there are several other approaches to multitask learning 
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2000 
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Bakker and Heskes (2003); Xue et al. 



the problem of constructing the covariance or kernel function for multiple outputs, so that 
it can be employed, for example, together with Gaussian process prediction. Then we 
review the convolution process approach in Section [3] and Section |4j We demonstrate how 
our conditional independence assumptions can be used to reduce the computational load 
of inference in Section [5] Experimental results are shown in Section [6] and finally some 
discussion and conclusions are presented in Section [7] 



2. Related Work 

In geostatistics, multiple output models are used to model the co-occurrence of minerals 
or pollutants in a spatial field. Many of the ideas for constructing covariance functions 
for multiple outputs have first appeared within the geostatistical literature, where they are 
known as linear models of coregionalization (LMC). We present the LMC and then review 
how several models proposed in the machine learning literature can be seen as special cases 
of the LMC. 



2.1 The Linear Model of Coregionalization 



The term linear model of coregionalization refers to models in which the outputs are ex- 
pressed as linear combinations of independent random functions. If the independent random 
functions are Gaussian processes then the resulting model will also be a Gaussian process 
with a positive semi-definite covariance function. Consider a set of D output functions 
{fd( x )}d=x where x G W is the input domain. In a LMC each output function, /d(x), is 



expressed as (Journel and Huijbregts 1978) 



f d (x) = ^a^qUqi-X) + fjL d . 



(1) 



q=l 



Under the GP interpretation of the LMC the functions {u q (x)}® =1 are taken (without loss 
of generality) to be drawn from a zero-mean GP with cov[u g (x), u q i(x')] = k q (x,x') if 
q = q' and zero otherwise. Some of these base processes might have the same covariance, 
i.e. kJx, x') = k q '(x,x'), but they would still be independently sampled. We can group 



together (Journel and Huijbregts, 1978; Goovaerts, 1997) the base processes that share 



latent functions, allowing us to express a given output as 



/d(x) 



Q 

EE' 

9=1 i=l 
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where the functions |w*(x)|._f , % = 1, . . . , R q , represent the latent functions that share the 
same covariance matrix k q (x, x'). There are now Q groups of functions, each member of a 
group shares the same covariance, but is sampled independently. 

In geostatistics it is common to simplify the analysis of these models by assuming that 



the processes /d(x) are stationary (Cressie, 19931. The stationarity and ergodicity condi- 



tions are introduced so that the prediction stage can be realized through an optimal linear 



predictor using a single realization of the process (Cressie, 19931. Such linear predictors 
receive the general name of cokriging. The cross covariance between any two functions /d(x) 
and /d'(x) is given in terms of the covariance functions for u q (x) 

Q Q Rq Rq 

COv[/ d (x), /d'(x')] =2222 a d, q ad>,q> CQV[?4(X), llTq, (x)]. 
5=1 q' = \ i=l i'=l 

Because of the independence of the latent functions u q (x), the above expression can be 
reduced to 

Q R q Q 

COv[/ d (x),/ d /( X ')] = 2 2 a d,q a d',q k q( X > X ') = 2 b d,d' ^( X > X ') . ( 2 ) 
q=l i=l q=l 

With b d,d> = £fil a iq a d',q- 

For a number N of input vectors, let be the vector of values from the output d 
evaluated at X = {xn}n=i- If each output has the same set of inputs the system is known 
as isotopic. In general, we can allow each output to be associated with a different set 
of inputs, = {xi^J^j, this is known as heterotopic^ For notational simplicity, we 
restrict ourselves to the isotopic case, but our analysis can also be completed for heterotopic 
set ups. The covariance matrix for is obtained expressing equation ([2| as 

Q R q Q 

COv[f d , f d /] =22 a d,q a d',q K q = 2 b d,d> K Q> 
q=l i=l 5=1 

where K. q G $t. NxN has entries given by computing fc ? (x, x') for all combinations from X. 
We now define f to be a stacked version of the outputs so that f = [f^ , . . . , f^] T . We can 
now write the covariance matrix for the joint process over f as 

Q Q 

K f , f = 2 A 9 A J ® K <? = 2 B <? ® ( 3 ) 

5=1 5=1 

where the symbol <S> denotes the Kronecker product, A q G ?R. DxRq has entries a l dq and 
Bg = AqAJ G ^R DxD has entries and is known as the coregionalization matrix. The 
covariance matrix Kf f is positive semi-definite as long as the coregionalization matrices B g 
are positive semi-definite and k q (x,x') is a valid covariance function. By definition, core- 
gionalization matrices B q fulfill the positive semi-definiteness requirement. The covariance 



2. These names come from geostatistics. 
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functions for the latent processes, fc g (x, x'), can simply be chosen from the wide variety of 
covariance functions (reproducing kernels) that are used for the single output case. Exam- 
ples include the squared exponential (sometimes called the Gaussian kernel or RBF kernel) 



and the Matern class of covariance functions (see Rasmussen and Williams 2006 chap. 4). 

The linear model of coregionalization represents the covariance function as a product 
of the contributions of two covariance functions. One of the covariance functions models 
the dependence between the functions independently of the input vector x, this is given 
by the coregionalization matrix B g , whilst the other covariance function models the input 
dependence independently of the particular set of functions /d(x), this is the covariance 
function fc ? (x, x'). 

We can understand the LMC by thinking of the functions having been generated as a 
two step process. Firstly we sample a set of independent processes from the covariance 
functions given by fc g (x,x'), taking R q independent samples for each fc g (x, x'). We now 
have R = Ylq=i Rq independently sampled functions. These functions are instantaneously 
mixea^'m a linear fashion. In other words the output functions are derived by application 
of a scaling and a rotation to an output space of dimension D. 



2.1.1 Intrinsic Coregionalization Model 

A simplified version of the LMC, known as the intrinsic coregionalization model (ICM) (see 



Goovaerts 1997 1, assumes that the elements b dd , of the coregionalization matrix B g can be 
written as b dd , = Vd^d'bq- In other words, as a scaled version of the elements b q which do 
not depend on the particular output functions /d(x). Using this form for b dd ,, equation ^ 
can be expressed as 

Q Q 

COv[/ d (x),/ d /(x')] =y)vd,d'b q kg(x.,X.') = VdJ'/sbgkgfa-X.'). 

q=l g=l 

The covariance matrix for f takes the form 

K f , f = T <g> K, (4) 

where T S $l DxD with entries v^d' an d K = Ylq=i bq^-q is an equivalent valid covariance 
function. This is also equivalent to a LMC model where we have Q = 1. As pointed out by 



Goovaerts (1997), the ICM is much more restrictive than the LMC since it assumes that 
each basic covariance k q (x, x') contributes equally to the construction of the autocovariances 
and cross covariances for the outputs. 

2.1.2 Linear Model of Coregionalization in Machine Learning 

Several of the approaches to multiple output learning in machine learning based on kernels 
can be seen as examples of the linear model of coregionalization. 



3. The term instantaneous mixing is taken from blind source separation. Of course if the underlying 
processes are not temporal but spatial, instantaneous is not being used in its original sense. However, it 
allows us to distinguish this mixing from convolutional mixing. 
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Semiparametric latent factor model. In Teh et ah] ( |2005 ), the model proposed to 
construct the covariance function for multiple outputs turns out to be a simplified version 
of equation Q. In particular, if R q = 1 (see equation Q), we can rewrite equation ^ as 



K 



f,f 



q=l 



a q a q (8) K 3 , 



where a g G $l Dxl with elements 
properties of the Kronecker productH we can write 



With some algebraic manipulations that exploit the 



K 



f,f 



[a q ® I 7V )K (? (aJ ® Ijv) = (A ® Ijv)K(A T ® Ijv), 



where I at is the iV-dimensional identity matrix, A € $t Dx ® is a matrix with columns a g 
and K £ sfcQ Nx Q N is a block diagonal matrix with blocks given by J£ q . 

The functions u q (x.) are considered to be latent factors and the model for the outputs 
was named semiparametric latent factor model (SLFM). The semiparametric name comes 
from the fact that it is combining a nonparametric model, i.e. a Gaussian process with 
a parametric linear mixing of the functions u q (x). The kernel for each basic process q, 
fc g (x, x') , is assumed to be of Gaussian type with a different length scale per input dimension. 



For computational speed up the informative vector machine (IVM) is employed (Lawrence 



et al. 2003). 



Multi-task Gaussian processes. The intrinsic coregionalization model has been em- 
ployed in Bonilla et al.|(2008) for multitask learning. The covariance matrix is expressed as 



K 



f(x),f(x') 



Kf ® fc(x, x'), with f(x) = [/i(x), . . . , /£>(x)] T , K-f being constrained positive 
semi-definite and fe(x, x') a covariance function over inputs. It can be noticed that this 
expression has is equal to the one in Q, when it is evaluated for x, x' G X. In Bonilla 



et al. 



(20081, K f (equal ;o T in equation Q) expresses the correlation between tasks or 



inter-task dependencies and it is represented through a PPCA model, with the spectral 
factorization in the PPCA model replaced by an incomplete Cholesky decomposition to 
keep numerical stability. For A;(x, x') (the function used in equation Q to compute K), the 
squared-exponential kernel is employed. To reduce computational complexity, the Nystrom 
approximation is applied. 

It can be shown that if the outputs are considered to be noise- free, prediction using the 
intrinsic coregionalization model under an isotopic data case is equivalent to independent 
prediction over each output (Helterbrand and Cressie 1994 1 . This circumstance is also 



known as autokrigeability (Wackernagel 



of inter-task transfer (Bonilla et al. 



2008 



2003) and it can also be seen as the cancellation 



Multi-output Gaussian processes. The intrinsic coregionalization model has been also 
used in Osborne et al. (2008). Matrix T in expression Q is assumed to be of the spherical 
parametrisation kind, T = diag(e)S T S diag(e), where e gives a description for the length 
scale of each output variable and S is an upper triangular matrix whose i-th column is 



4. See Brookes (20051 for a nice overview 
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associated with particular spherical coordinates of points in 3ft* (for details see Osborne and 



Roberts 2007, sec. 3.4). Function fc(x,x') is represented through a Matern kernel, where 
different parametrisations of the covariance allow the expression of periodic and non-periodic 
terms. Sparsification for this model is obtained using an IVM style approach. 

Multi-task kernels. Kernels for multiple outputs have also been studied in the context of 
regularization theory. The approach is based mainly on the definition of kernels for multitask 



learning provided in Evgeniou and Pontil ( 2004 ) and Evgeniou et al. ( 2005 1 , derived based 
on the theory of kernels for vector- valued functions. Let T> = {!,... ,D}. According to 



Evgeniou et al. (2005), the following lemma can be used to construct multitask kernels, 



Lemma 1 If G is a kernel on T x T and, for every d £ T> there are prescribed mappings 
: X — > T such that 

£^(x,x') = k((yL,d),(x',d')) = G(<D d (x),<Mx')), x,x'e5T, d,d'eV, 
then k(-) is a multitask or multioutput kernel. 



A linear multitask kernel can be obtained if we set T = 3ft m , $^(x) = C^x with &d £ !R mxp 
and G : 3ft" 1 x K m — > K as the polynomial kernel G(z, z') = (z T z') ra with n = 1, leading to 
(x,x') = x T CjCrf'x'. Lemma [l| can be seen as the result of applying kernel properties 
to the mapping ^(x) (see Genton 2001 , pag. 2). Notice that this corresponds to the linear 
model of coregionalization where each output is expressed through its own basic process 
acting over the linear transformation C^x, this is, ^(^(x)) = -u d (Cdx). 



3. Convolution processes for multiple outputs 

The approaches introduced above all involve some form of instantaneous mixing of a series 
of independent processes to construct correlated processes. Instantaneous mixing has some 
limitations. If we wanted to model two output processes in such a way that one process was 
a blurred version of the other, we cannot achieve this through instantaneous mixing. We 
can achieve blurring through convolving a base process with a smoothing kernel. If the base 
process is a Gaussian process, it turns out that the convolved process is also a Gaussian 
process. We can therefore exploit convolutions to construct covariance functions (Barry 
and Hoef, 1996 Hoef and Barry 1998; Higdon, 1998 2002). A recent review of several 



extensions of this approach for the single output case is presented in Calder and Cressie 



( 2007 ) . Applications include the construction of nonstationary covariances ( Higdon 



covariances ( |Wikle et al.| |1998[ |Wik!el 12002[ |2003). 



1998 



Higdon et al. 1998; Fuentes 2002a|b| Paciorek and Schervish 2004) and spatiotemporal 



Higdon (2002) suggested using convolutions to construct multiple output covariance 



functions. The approach was introduced to the machine learning community by Boyle and 



Frean 



(2005) 



Consider again a set of D functions {/d(x)}£ =1 . 



Now each function could be 

expressed through a convolution integral between a smoothing kernel, {G^x)}^^ and a 
latent function u(x), 



/d(x) 



A' 



Gd(x — z)n(z)dz. 



(5) 
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More generally, we can consider the influence of more than one latent function, {u q (z)}^ =1 , 
and corrupt each of the outputs of the convolutions with an independent process (which 
could also include a noise term), w d (x), to obtain 



(6) 



Q , 

y d (x) = / d (x) + w d (x) = ^2 G d,q(x ~ z)u g (z)dz + w d (x). 

q=l J* 

The covariance between two different outputs y d (x) and yd'(x') is then recovered as 

cov [y d (x),y d f(x')] =cov [/ d (x), /^(x')] + cov [wrf(x), Wd'(x')] <$d,d', 

where 6 dd > is the Kronecker delta function and 

Q Q , , 
cov [/d( x )>/d( x 0] =X) X] / G ^( x_z ) / G d',q'(x' - z')cov [tt g (z),u g /(z')] dz'dz. (7) 

Specifying G d ,q(x — z) and cov [it g (z), « g /(z')] in (Q, the covariance for the outputs /<f(x) 
can be constructed indirectly. Note that if the smoothing kernels are taken to be the Dirac 
delta function such that, 

Grf, ? (x - z) = a dj9 <5(x - z'), 

where <5(-) is the Dirac delta function^] the double integral is easily solved and the linear 
model of coregionalization is recovered. This matches to the concept of instantaneous mixing 
we introduced to describe the LMC. In a convolutional process the mixing is more general, 
for example the latent process could be smoothed for one output, but not smoothed for 
another allowing correlated output functions of different length scales. 

The traditional approach to convolution processes in statistics and signal processing is 
to assume that the latent functions u q (z) are independent white Gaussian noise processes, 
cov [u q (z),u q i(z')~\ = au q 5q,q'6(z — z'). This allows us to simplify (J7]) as 

Q r 

cov [/d(x),/#(x')] = Vcr^ / G d ,q(x - z)Gd',g(x' - z)dz. 

In general though, we can consider any type of latent process, for example, we could 
assume independent GPs for the latent functions so that we have cov [u q (z),u q i(z')~\ = 
ku q ,u q ,(z,z')S q>q i. With this form of the latent functions, ([7]) can be written as 

Q f f 

cov [/rf(x),/rf/(x / )j = /; G d ,q(x.-z) Gd', q (-x.' -z')k Uq)Uq (z,z')dz'dz. (8) 

q=l ^ X J X 

As well as this correlation across outputs, the correlation between the latent function, u q (z), 
and any given output, /d(x), can be computed, 

cov[fd(x),u q (z)))= G d , q {x- z')k Uq)Uq {z\z)dz . (9) 
Jx 



5. We have slightly abused of the delta notation to indicate the Kronecker delta for discrete arguments and 
the Dirac function for continuous arguments. The particular meaning should be understood from the 
context. 
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Higdon (2002| proposed the direct use of convolution processes for constructing multiple 



output Gaussian processes. Lawrence et al. (20071 arrive at a similar construction from 



solving a physical model: a first order differential equation (see also Gao et al. 20081. This 



id ea of using phys i cal m odels to inspire multiple output systems has been further extended 



m 



Alvarez et al. (2009) who give examples using the heat equation and a second order 



system. A different approach using Kalman Filtering ideas has been proposed in Calder 



(2003 2007). Calder proposed a model that incorporates dynamical systems ideas to the 



process convolution formalism. Essentially, the latent processes are of two types: random 
walks and independent cyclic second-order autoregressions. With this formulation, it is 
possible to construct a multivariate output process using convolutions over these latent 
processes. Particular relationships between outputs and latent processes are specified using 
a special transformation matrix ensuring that the outputs are invariant under invertible 
linear transformations of the underlying factor processes (this matrix is similar in spirit to 



the sentitivity matrix of Lawrence et al. (2007) and it is given a particular form so that not 



all latent processes affect the whole set of outputs (Calder, 2007)) 



Bayesian kernel methods. The convolution process is closely related to the Bayesian 



kernel method (Pillai et al. 2007; Liang et al. 2009) for constructing reproducible kernel 



Hilbert spaces (RKHS), assigning priors to signed measures and mapping these measures 
through integral operators. In particular, define the following space of functions, 



? = {/|/0*0 = J G(x, zh(dz), 7 g r|, 



for some space T C B(X) of signed Borel measures. In Pillai et al. (2007 proposition 1), 
the authors show that for V = B(X), the space of all signed Borel measures, T corresponds 
to a RKHS. Examples of these measures that appear in the form of stochastic processes 
include Gaussian processes, Dirichlet processes and Levy processes. This framework can be 
extended for the multiple output case, expressing the outputs as 



fd{x) 



G d (x,z)-y(dz). 



x 



The analysis of the mathematical properties of such spaces of functions are beyond the 
scope of this paper and are postponed for future work. 



4. Constructing Multiple Output Convolution Processes 

We now consider practical examples of how these multiple output convolution processes can 
be constructed. W e start with a more g eneric example (although it has an underlying phys- 
ical interpretation (Alvarez et al. 2009 )), which can be seen as the equivalent of the squared 
exponential covariance function for multiple outputs. We will then consider a particular 
physical model: a simple first order differential equation for modeling transcription. 

Example 1. A general purpose convolution kernel for multiple outputs. A 

simple general purpose kernel for multiple outputs based on the convolution integral can 
be constructed assuming that the kernel smoothing function, Grf ir (x), and the covariance 
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for the latent function, k UqjU (x, x'), follow both a Gaussian form. The kernel smoothing 
function is given as 

G^(x) = S d , q M(x\0,p- 1 ), 

where Sd „ is a variance coefficient that depends both of the output d and the latent function 
q and is the precision matrix associated to the particular output d. The covariance 
function for the latent process is expressed as 

fc Ugj% (x,x / )=AT(x-x / |0 J A- 1 ), 

with A q the precision matrix of the latent function q. 

Expressions for the kernels are obtained applying systematically the identity for the 
product of two Gaussian distributions. Let AA(x|/x, P^ 1 ) denotes a Gaussian for x, then 

A/^xl/^P^MxI/^Ps 1 ) =AA(/x 1 | / ^ 2 ,Pr 1 +P 2 1 )AA(x| / i c ,p- 1 ), (10) 

where /x c = (P x + P 2 ) _1 (Pi/xi + P 2 /x 2 ) and P" 1 = (Pi + P 2 )~ 1 . For all integrals we 
assume that X = W. Using these forms for Gd i<? (x) and fc n<3jUg (x, x'), expression Q can be 
written as 

fc /di/d ,(x,x') = y> d , 9 Stf,, / A^x-z^p- 1 ) / jV(x'-z / |0,P/)JV*(z-z'|0 ) A- 1 )dz / dz. 
p[ Jx Jx 

Since the Gaussian covariance is stationary and isotropic, we can write it as AA(x— x'|0, P _1 ) = 



jV(x'— x|0, P 1 ) = AA(x|x', P 1 ) = jV(x'|x, P r ). Using the identity in equation (fTob twice, 
we get 

Q 

kf d ,f d , (x, x') = ^ S^S^i* - x'|0, P- 1 + P,, 1 + A- 1 ). 

9=1 

For a high value of the input dimension, p, the term l/[(27r) p / 2 |P ( ^ 1 + P ( ^ 1 +A~ 1 | 1 / 2 ] in each 
of the Gaussian's normalization terms will dominate, making values go quickly to zero. We 
can fix this problem, by scaling the outputs using the factors l/[(2 7 r)f/ 4 |2Pj 1 + A^ 1 | 1 / 4 ] 
and l/[(27r) p/ ' 4 |2P^, 1 + A" 1 ) 1 / 4 ]. Each of these scaling factors correspond to the standard 
deviation associated to kf d j d (x,x) and kf ,,/., (x, x). 

Equally for the covariance cov [/d(x), it g (x'))] in equation Q, we obtain 

fc /d , U9 (x,x') = Sd, q N(x - x'|0, P^ 1 + A" 1 ). 

Again, this covariance must be standardized when working in higher dimensions. 

Example 2. Convolution kernels constructed through a first order differential 
equation. The convolution integral appears naturally when solving ordinary differential 
equations. In this case, the smoothing kernel function corresponds to the impulse response 
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of the system described by the differential equation. As an example consider the following 
set of first order differential equations where the input variable is time, t, 



dm 

dt 



^2s dtq u q (t) -jdfd{t), d=l,...,D, 

q=l 



where jd is a parameter of the particular system (electrical circuit, mechanical system, 
among others) and Sd q quantifies the influence of latent function q over output d. Assuming 
initial conditions equal to zero, the solution to the above equation is given as 



fd(t) = V S d , q / exp(-7 d (t - r))ng(r)dr. 



(11) 



If the functions {u q (t)}® =1 are Gaussian processes with squared exponential kernel 

/ /"\ ( {t-t') 2 ' 

K q ,u q {t, t ) = exp ^ ^ — 

where i q represents the length-scale parameter, the covariance for the outputs can be found 
( |Lawrence et al.[ |2007| |Gao et ah] |2008[ ) as 

Q j-t rt> 

9 =i J o Jo 

Solving the above equation, the covariance kf d j ,(t,t') is given as 

exp(-7 d t - 7d'tf)[h q (rfd',7d,t,t?) + h q (7 d , 7d',t',t)], 



g=l 



Sd,qS d ', q y/7rlq 



where 



h q (7d,7d',t,t') 



exp 



2 



7d + 7d> 



exp[(7 rf + 7d')t]H q (7d, t, t') - TC q (7d, 0, t') 



and 



H q (7d,t,t') = erf - + 



t , 7dtq 



tq ' 2 



In the above expression, the function erf (x) is the so called error function and it is defined 
as erf (z) = J*^ exp(— £ 2 )d£. The covariance between f d (t) and u q (t') is given as 



kf d ,u q {t, t ) 



l\ _ S d ,q\/7r£q 



exp 



(l^y}exp[-7 d (t-t')]H q (7d,t',t). 
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5. Sparse Approximations for Convolutional Processes 



Assuming that the double integral is tractable, the principle challenge for the convolu- 
tional framework is computing the inverse of the covariance matrix associated with the out- 
puts. For D outputs, each having N data points, the inverse has computational complexity 
0(D 3 iV 3 ) and associated storage of 0(D 2 N 2 ). We show how throug h making specific condi 



tional independence assumptions, inspired by the model structure (Alvarez and Lawrence 



20091, we arrive at a sparse approximation similar in form to the partially independent 
training conditional model (PITC, see Quihonero Candela and Rasmussen 20051. The 
relationship with PITC inspires us to make further conditional independence assumptions. 



5.1 Full Dependence 

Given the convolution formalism, we can construct a full GP over the set of outputs. The 
likelihood of the model is given by 

p(y|X,0)=jV(O,K flf + E) ) (12) 

where y = [yj , . . . ,yJ] T is the set of output functions with y d = [yd(xi), . . . ,y d (-x N )} T ; 
Kf f € ?ft DNxDN is the covariance matrix arising from the convolution. It expresses the 
covariance of each data point at every other output and data point and its elements are 
given by cov [/^(x), /^(x')] in ([8]). The term E represents the covariance associated with 
the independent processes in (pL to^x). It could contain structure, or alternatively could 
simply represent noise that is independent across the data points. The vector 6 refers to the 
hyperparameters of the model. For exposition we will focus on the isotopic case (although 
our implementations allow heterotopic modeling), so we have a matrix X = {xi, . . . ,xjy} 
which is the common set of training input vectors at which the covariance is evaluated. 



The predictive distribution for a new set of input vectors X* is (Rasmussen and Williams 



2006) 



p(y*|y, X,X*,0) =N (K f „ f (K f , f + E^y, K Ut£t - K f „ f (K f)f + E)" 1 !^. + E) , 

where we have used Kf as a compact notation to indicate when the covariance ma- 
trix is evaluated at the inputs X*, with a similar notation for Kf; f . Learning from the 
log-likelihood involves the computation of the inverse of Kf f + E giving the problematic 
complexity of 0(iV 3 L> 3 ). Once the parameters have been learned, prediction is O(ND) for 
the predictive mean and 0(N 2 D 2 ) for the predictive variance. 



5.2 Latent functions as conditional means 

We restrict the analysis of the approximations to one latent function u(x). The key to 
all approximations is based on the form we assume for the latent functions. From the 
perspective of a generative model, equation ^ can be interpreted as follows: first we draw 
a sample from the Gaussian process prior p(u(z)) and then solve the integral for each of the 
outputs /d(x) involved. Uncertainty about u(z) is also propagated through the convolution 
transform. 

For the set of approximations, instead of drawing a sample from u(z), we first draw a 
sample from a finite representation of u(z), u(Z) = [u(zi), . . . , u{zk)V j where Z = {zk\k=i 
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is the set of input vectors at which u(z) is evaluated. Due to the properties of a Gaussian 
process, p(u(Z)) follows a multivariate Gaussian distribution. Conditioning on u(Z), we 
next sample from the conditional prior p(u(z)|u(Z)) and use this function to solve the 
convolution integral for each /^(x)J^] Under this generative approach, we can approximate 
each function /d(x) using 

/ d (x)« f G d (x-z)E[ U (z)|u]dz. (13) 
Jx 

Replacing u(z) for E[n(z)|u] is a reasonable approximation as long as u{z) be a smooth 
function so that the infinite dimensional object u(z) can be summarized by u. Figure [T] 
shows a cartoon example of the quality of the approximations for two outputs as the size 
of the set Z increases. The first column represents the conditional prior p(u(z)\u) for a 
particular choice of u{z). The second and third columns represent the outputs /i(x) and 



/2(x) obtained when using equation (13). 



Using expression (13), the likelihood function for f follows 



p(f |u, Z, X, 6) = N ( K f>u K- U, K f)f - Kf uK-^K 



f,u 



where K UjU is the covariance matrix between the samples from the latent function u(Z), 
with elements given by k U:U (z,z') and Kf u = K^ f is the cross-covariance matrix between 
the latent function u(z) and the outputs /d(x), with elements cov [/d(x), u(z)] in ([9]). Given 
the set of points u, we can have different assumptions about the uncertainty of the outputs 
in the likelihood term. For example, we could assume that the outputs are independent 
or uncorrelated, keeping only the uncertainty involved for each output in the likelihood 
term. Other approximation would assume that the outputs are deterministic, this is Kf f = 
Kf >u K u { 1 K i T u . The only uncertainty left would be due to the prior p(u). Next, we present 
different approximations of the covariance of the likelihood that lead to a reduction in 
computational complexity. 



5.2.1 Partial Independence 

We assume that the set of outputs f are independent given the latent function u, leading 
to the following expression for the likelihood 

D D 

p(f|u,Z,X,0) = ]Jp(f d \u,Z,X,e) = \\N (Kf djU K u ^u, Kf di f d — Kf djU K~ ^K u f d ) . 

d=l d=l 

We rewrite this product of multivariate Gaussians as a single Gaussian with a block diagonal 
covariance matrix, including the uncertainty about the independent processes 

p(y|u, Z, X, 6) = M (Kf.uK->, D + S) (14) 

where D = blockdiag [Kf f — Kf u K~^,K U) f] , and we have used the notation blockdiag [G] 
to indicate that the block associated with each output of the matrix G should be re- 
tained, but all other elements should be set to zero. We can also write this as D = 

6. For simplicity in the notation, we just write u to refer to u(Z). 
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(a) Conditional prior for K = 5 



(b) Output one for K = 5 



-0.2 0.2 0.4 0.6 



(c) Output two for K = 5 




-0.8 -0.6 -0.4 -0.2 0.2 0.4 0.6 



(d) Conditional prior for K = 10 




■0.2 0.2 0.4 0.6 0.8 



(e) Output one for K = 10 




-0.2 0.2 0.4 0.6 



(f) Output two for K = 10 




0.2 0.2 0.4 0.6 0.8 



(g) Conditional prior for K — 30 



(h) Output one for K = 30 




0.2 0.2 0.4 0.6 



(i) Output two for K — 30 



Figure 1: Conditional prior and two ouputs for different values of K. The first column, figures 
l(a)[ 1(d) ano - l(g)| shows the mean and confidence intervals of the conditional prior distribution 
using one input function and two output functions. The dashed line represents one sample from the 
prior. Conditioning over a few points of this sample, shown as black dots, the conditional mean and 
conditional covariance are computed. The solid line represents the conditional mean and the shaded 
region corresponds to 2 standard deviations away from the mean. The second column, 1(b) | 1(e) 
and l(h)| shows the solution to equation (|5| for output one using the sample from the prior (dashed 
line) and the conditional mean (solid line), for different values of K. The third column, 1 (c) | 1(f) 
and l(i) shows the solution to equation ^ for output two, again for different values of K. 



[Kf 5 f — Kf u K u uK U) f] M where is the Hadamard product and M = I,d In, Ijv 
being the N x N matrix of ones. We now marginalize the values of the samples from the 
latent function by using its process prior, this means p(u|Z) = AA(0,K UiU ). This leads to 
the following marginal likelihood, 

p(y|Z, X, 6) = [ p(y|u, Z, X, 0)p(u|Z)du = M (0, D + Kf u K u „K U f + S) . (15) 
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Notice that, compared to ( 12 ), the full covariance matrix Kf f has been replaced by the low 



rank covariance Kf U K U uK u ,f in all entries except in the diagonal blocks corresponding 
to Kf dj f d , . Depending on our choice of K the inverse of the low rank approximation to 
the covariance is either dominated by a 0(DN 3 ) term or a 0(K 2 DN) term. Storage of 
the matrix is 0(N 2 D) + O(NDK). Note that if we set K = N these reduce to 0(N 3 D) 
and 0(N 2 D) respectively Rather neatly this matches the computational complexity of 
modeling the data with D independent Gaussian processes across the outputs. 



The functional form of (15) is almost identical to that of the PITC approximation 



(Quihonero Candela and Rasmussen 2005), with the samples we retain from the latent 



function providing the same role as the inducing values in the partially independent training 
conditional (PITC) approximation. This is perhaps not surprising given that the PITC 
approximation is also derived by making conditional independence assumptions. A key 
difference is that in PITC it is not obvious which variables should be grouped together when 
making these conditional independence assumptions, here it is clear from the structure of 
the model that each of the outputs should be grouped separately. However, the similarities 



are such that we find it convenient to follow the terminology of Quihonero Candela and 



Rasmussen (2005) and also refer to our approximation as a PITC approximation. 



5.2.2 Full Independence 

We can be inspired by the analogy of our approach to the PITC approximation and consider 
a more radical factorization of the likelihood term. In the fully independent training condi- 



tional (FITC) (Snelson and Ghahramani, 2006, 2007) a factorization across the data points 



is assumed. For us that would lead to the following expression for conditional distribution 
of the output functions given the inducing variables, 

D N 

p(f |u, Z, X, 6) = Yl Y[ p(/ M |u, Z, X, 9), 



d=ln=l 



Kf.uK-^f 



which can be briefly expressed through (14) with 
[K fjf - K f u K u ^K u f ] M, with M = I D ® I N . 
uncertainty about the independent processes is given by equation 
form for D, leading to the fully independent training conditional (FITC) approximation 
(Snelson and Ghahramani 2006; Quihonero Candela and Rasmussen 2005| ). 



D = diag [Kf .f 
The marginal likelihood, including the 
15|) with the diagonal 



5.2.3 Deterministic likelihood 



In Quihonero Candela and Rasmussen (2005) the relationship between the projected process 



approximation and the FITC and PITC approximations is elucidated. They show that if, 



given the set of values u, the ouputs are deterministic, the likelihood term of equation (13) 
can be simplified as 

p(f |u, Z, X, 6) = AT (Kf^K-^u, 0) . 

Marginalizing with respect to the latent function using p(u|Z) = M(0, K U;U ) and including 
the uncertainty about the independent processes, we obtain the marginal likelihood as 

p(y|Z, X, &) = J p(y|u, Z, X, 0)p(u|Z)du = M (o, Kf^K^K^ + s) . 
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In other words, we approximate the full covariance Kf f using the low rank approximation 
'-iK T 



Kf u K u ^KJ . Using this new marginal likelihood to estimate the parameters 6 reduces 



computational complexity to 0(K 2 DN). The approximation obtained has similarities with 
the projected latent variables (PLV) method also known as the projected process approxi- 



mation (PPA) or the deterministic training conditional (DTC) approximation (Csato and 



Upperj [200T j |S eeger et al.| |2003| IQuihonero Candela and Rasmussen[ |2005| |Rasmussen and 



Williams 2006 ) . For this reason we refer to this approximation as the deterministic training 



conditional approximation (DTC) for multiple output Gaussian processes. 
5.2.4 Additional independence assumptions. 

As mentioned before, we can consider different conditional independence assumptions for 
the likelihood term. One further assumption that is worth mentioning considers conditional 
independencies across data points and dependence across outputs. This would lead to the 
following likelihood term 

TV 

p(f|u,Z,X,0) = I]p(f n |u,Z,X,0), 

n=l 



where f ra = [/i(x n ), /2(x n ), . . . , /o(x n )] . We can use again equation (14) to express the 
likelihood. In this case, if the matrix D is a partitioned matrix with blocks G ?R. NxN , 

each block T)d,d' would be given as T)d,d' = diag [Kf d) f , — Kf djU K u „K Uj f d ,] . For cases in 
which D > N, that is, the number of outputs is greater than the number of data points, 
this approximation may be more accurate than PITC. For cases where D < N it may be 
less accurate than PITC, but faster to compute Q 

5.3 Posterior and predictive distributions 

Combining the likelihood term for each approximation with p(u|Z) using Bayes' theorem, 
the posterior distribution over u is obtained as 

p(u|y, X, Z, 6) = M (K UiU A _1 K Uj f (D + E^y, K^A^K^) (16) 

where A = K uu + K Uj f(D + E) _1 Kf u and D follows a particular form according to the 
different approximations: for PITC it equals D = blockdiag [Kf f f — Kf^K^ ^K Uj f] , for 
FITC D = diag [K f)f - Kf )U K~^K U) f] and for DTC D = 0. The predictive distribution is 



expressed through the integration of the likelihood term, evaluated at X*, with (16), giving 
p(y*|y,X,X*,Z,0) = J p( y ,|u,Z,X*,0)p(u|y,X,Z,0)du 

=M (Kf i(iiU A _1 K U) f (D + E)-V, D» + Kf.^A" 1 ^. + E) , 

with D, = blockdiag [K f . if , - Kf^K^Ku,^ for PITC, D, = diag [K UtU - Kf.^K^K^f, 
for FITC and = for DTC. 



7. Notice that if we work with the block diagonal matrices T>d,d' > we would need to invert the full matrix D. 
However, since the blocks T>d,d' are diagonal matrices themselves, the inversion can be done efficiently 
using, for example, a block Cholesky decomposition. Furthermore, we would be restricted to work with 
isotopic input spaces. Alternatively, we could rearrange the elements of the matrix D so that the blocks 
of the main diagonal are the covariances associated with the vectors i„. 
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5.4 Fitting the Model 



The marginal likelihood approximation for the PITC, FITC and DTC variants for sparse 
multioutput Gaussian processes is a function of both the parameters of the covariance 
function and the location of the inputs for the inducing variables. One of the key ideas 



presented in Snelson and Ghahramani (2006) was that we should optimize with respect to 



the location of these inducing variables. Previously, the inducing variables were taken to be 



a subset of the data variables (Csato and Opper 2001 Williams and Seeger 2001 1 . However, 



a method of choosing which subset of the data is required (Smola and Bartlett 2001 Seeger 



et al. 20031. Such criteria can be expensive to compute and also lead to fluctuations in 



the approximation to the log-likelihood when the subset changes. This causes problems 
as while the parameters of the Gaussian process are optimized, the optimal subset of the 
inducing inputs will also change. Convergence is therefore difficult to monitor. The key 
advantage of optimizing the inducing input locations, Z, across the entire input domain is 
that convergence of the likelihood will be smooth. In appendix |A.l we include the derivatives 
of the marginal likelihood wrt the matrices Kf f , K u f and K u u . 



6. Experimental evaluation 

In this section we present results of applying the sparse methods in pollutant metal pre- 
diction, exam score prediction and the prediction of transcription factor behavior in a 
gene-network. First, though, we ilustrate the performance of the sparse method in a toy 
example j^] 



6.1 A toy example 

For the toy experiment, we employ the kernels constructed in Example 1 of section [4] 
The toy problem consists of D = 4 outputs, one latent function, Q = 1 and one input 
dimension. The training data was sampled from the full GP with the following parameters, 
S hl = S 2 ,i = 1, 5*3,1 = 5 4) i = 5, Pi,! = P 2 ,i = 50, P 3 ,i = 300,P 4 ,i = 200 for the 
outputs and Ai = 100 for the latent function. For the independent processes, u^(x), we 
simply added white noise separately to each output so we have variances a\ = a\ = 0.0125, 
cr| = 1.2 and erf = 1. We generate N = 500 observation points for each output and use 
200 observation points (per output) for training the full and the sparse multiple output GP 
and the remaining 300 observation points for testing. We repeated the same experiment 
setup 10 times and compute the standardized mean square error (SMSE) and the mean 
standardized log loss (MSLL) as defined in Rasmussen and Williams] ( |2006 ) . For the sparse 
methods we use K = 30 inducing inputs. We sought the kernel parameters and the positions 
of the inducing inputs through maximizing the marginal likelihood using a scaled conjugate 
gradient algorithm. Initially the inducing inputs are equally spaced between the interval 

[-1,1]- 

Figure [2] shows the training result of one of the ten repetitions. The predictions shown 
correspond to the full GP (Figure 2(a)), the DTC approximation (Figure [2(b) ), the FITC 



approximation (Figure 2(c)) and the PITC approximation (Figure 2(d)). 



Code to run all simulations in this section is available at http://www.cs.manchester.ac.uk/-neill/ 
multigp/ 
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(a) 2/4(0;) using the full GP 



(b) 2/4(0;) using the DTC approximation 




(c) 2/4(2;) using the FITC approximation 



(d) 2/4 (x) using the PITC approximation 



Figure 2: Predictive mean and variance using the full multi-output GP and the sparse approxima- 
tions for output 4. The solid line corresponds to the predictive mean, the shaded region corresponds 
to 2 standard deviations of the prediction. The dashed line corresponds to the ground truth signal, 
that is, the sample from the full GP model without noise. In these plots the predictive mean overlaps 
almost exactly with the ground truth. The dots are the noisy training points. The crosses in figures 



2(b) 2(c) and 2(d) correspond to the locations of the inducing inputs after convergence. 



Tables [T] and [2] show the average prediction results over the test set. Table [TJ shows 
that the SMSE of the sparse approximations is similar to the one obtained with the full 
GP. However, there are important differences in the values of the MSLL shown in table |2j 
DTC offers the worst performance. It gets better for FITC and PITC since they offer a 
more precise approximation to the full covariance. 

Also, the training times for iteration of each model are 1.97 ± 0.02 sees for the full GP, 
0.20 ± 0.01 sees for DTC, 0.41 ± 0.03 for FITC and 0.59 ± 0.05 for the PITC. 

As we have mentioned before, one important feature of multiple output prediction is 
that we can exploit correlations between outputs to predict missing observations. We used 
a simple example to illustrate this point. We removed a portion of one output between 
[—0.8,0] from the training data in the experiment before (as shown in Figure [3]) and train 
the different models to predict the behavior of y^x) for the missing information. The 
predictions shown correspond to the full GP (Figure |3(a) ), an independent GP (Figure [3 (b)[ ), 



the DTC approximation (Figure 3(c)[ ), the FITC approximation (Figure [3(d) ) and the PITC 
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Method 


SMSE yi (x) 


SMSE y 2 (x) 


SMSE y 3 (x) 


SMSE y A {x) 


Full GP 
DTC 
FITC 
PITC 


1.06 ±0.08 
1.06 ±0.08 
1.06 ±0.08 
1.06 ±0.08 


0.99 ±0.06 
0.99 ±0.06 
0.99 ±0.06 
0.99 ±0.06 


1.10 ±0.09 
1.12 ±0.09 
1.10 ±0.08 
1.10 ±0.09 


1.05 ±0.09 
1.05 ±0.09 
1.05 ±0.08 
1.05 ±0.09 



Table 1: Standarized mean square error (SMSE) for the toy problem over the test set. All numbers 
are to be multiplied by 1CP 2 . The experiment was repeated ten times. Table includes the value of 
one standard deviation over the ten repetitions. 



Method 


MSLL yi{x) 


MSLL y 2 (x) 


MSLL y 3 (x) 


MSLL y 4 (x) 


Full GP 
DTC 
FITC 
PITC 


-2.27 ±0.04 
-0.98 ±0.18 
-2.26 ±0.04 
-2.27 ±0.04 


-2.30 ±0.03 
-0.98 ±0.18 
-2.29 ±0.03 
-2.30 ±0.03 


-2.25 ±0.04 
-1.25 ±0.16 
-2.16 ±0.04 
-2.23 ±0.04 


-2.27 ±0.05 
-1.25 ±0.16 
-2.23 ±0.05 
-2.26 ±0.05 



Table 2: Mean standardized log loss (MSLL) for the toy problem over the test set. More negative 
values of MSLL indicate better models. The experiment was repeated ten times. Table includes the 
value of one standard deviation over the ten repetitions. 



approximation (Figure 3(e)| ) . The training of the sparse methods is done in the same way 



than in the experiment before. 

Due to the strong dependencies between the signals, our model is able to capture the 
correlations and predicts accurately the missing information. 

6.2 Heavy Metals in the Swiss Jura 

The first example with real data that we consider is the prediction of the concentration of 
several metal pollulants in a region of the Swiss Jura. The data consist of measurements 
of concentrations of several heavy metals collected in the topsoil of a 14.5 km 2 region of 
the Swiss Jura. The data is divided into a prediction set (259 locations) and a validation 
set (100 locations) In a typical situation, referred to as undersampled or heterotopic 
case, a few expensive measurements of the attribute of interest are supplemented by more 
abundant data on correlated attributes that are cheaper to sample. We follow the exper- 



iment described in Goovaerts (1997 p. 248, 249) in which a primary variable (cadmium) 
at prediction locations in conjunction with some secondary variables (nickel and zinc) at 
prediction and validation locations, are employed to predict the concentration of the pri- 
mary variable at validation locations. We compare results of independent GP, the different 



approximations described before, the full GP and ordinary cokriging. u For the convolved 



9. This data is available at 
10. Cokriging is the generalization of kriging to multiple outputs. Within cokriging there are several alter- 



http : //www. ai-geostats . org/ 



natives, including simple and ordinary cokriging. Interested readers are referred to (Goovaerts 1997 
ch. 6) for details. In the geostatistics literature, the usual procedure is to use the linear model of core- 
gionalization to construct a valid covariance function and then use the cokriging estimator for making 
predictions. 
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(a) 2/4 (a;) using the full GP 



(b) 2/4(2:) using an independent GP 




(c) 2/4 (x) using the DTC approximation 




X X XX< X X X XX X XX X V. -10 XXX XX X xxxxxxxxxxxxxxx> 



(d) 2/4 (x) using the FITC approximation 




-10X X X XXX XXX XXX XX X X xxxxx X X X X X 

-1 -0.5 03 1 

X 

(e) 2/4 (a;) using the PITC approximation 



Figure 3: Predictive mean and variance using the full multi-output GP, the sparse approximations 
and an independent GP for output 4 with a range of missing observations in the interval [—0.8, 0.0]. 
The solid line corresponds to the mean predictive, the shaded region corresponds to 2 standard 
deviations away from the mean and the dash line is the actual value of the signal without noise. 
The dots are the noisy training points. The crosses in figures 3(c)| |3(d)| and 3(e) correspond to the 
locations of the inducing inputs after convergence. 



GPs, we use one {Q = 1) latent function. For the sparse approximations results, a k-means 
procedure is employed first to find the initial locations of the inducing values and then 
these locations are optimized in the same optimization procedure used for the parameters. 
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Each experiment is repeated ten times. The result for ordinary cokriging was obtained from 



Goovaerts (p. 248, 249 1997 1 . In this case, no values for standard deviation are reported. 
Figure [4] shows the results of prediction for cadmium (Cd). It can be noticed that as more 
inducing values are included, the approximations follow the performance of the full GP, as 
would be expected. For this particular dataset, FITC and PITC exhibit lower variances 
compared to DTC. In terms of the performance, it can be seen that PITC outperforms 
FITC and DTC when compared in terms of the number of inducing points. FITC and 
PITC also outperform the independent GP method, and for 200 and 500 inducing points, 
they outperform the cokriging method. 
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Figure 4: Mean absolute error and standard deviation for prediction of the pollutant metal Cadmium. 
The experiment was repeated ten times. In the bottom of the figure T)K, FK, PK stands for DTC, 
FITC and PITC with K inducing values, respectively, FGP stands for full GP, CK stands for 



ordinary cokriging using the linear model of coregionalization (see Goovaerts (1997 1 for detailed 



description of the ordinary cokriging estimator) and IND stands for independent GP. 



6.3 Exam score prediction 

In the second experiment with real data the goal is to predict the exam score obtained by 
a particular student belonging to a particular school. The data comes from the Inner Lon- 
don Education Authority (ILEA) J^j It consists of examination records from 139 secondary 
schools in years 1985, 1986 and 1987. It is a random 50% sample with 15362 students. The 
input space consists of four features related to each student (year in which each student 
took the exam, gender, VR band and ethnic group) and four features related to each school 



11. This data is available at http : //www . cmm.br istol . ac . uk/learning- training/mult ilevel-m- support/ 
datasets . shtml 
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(percentage of students eligible for free school meals, percentage of students in VR band 
one, school gender and school denomination). From the multiple output point of view, each 
school represents one output and the exam score of each student a particular instantiation 
of that output or D = 139. 



We follow the same preprocessing steps employed in Bonilla et al. (20081. The only 
features used are the student-dependent ones, which are categorial variables. Each of them 
is transformed to a binary representation. For example, the possible values that the variable 
year of the exam can take are 1985, 1986 or 1987 and are represented as 100, 010 or 001. 
The transformation is also applied to the variables gender (two binary variables), VR band 
(four binary variables) and ethnic group (eleven binary variables), ending up with an input 
space with 20 dimensions. The categorial nature of the data restricts the input space to 
N = 202 unique input feature vectors. However, two students represented by the same 
input vector x, and belonging both to the same school, d, can obtain different exam scores. 
To reduce this noise in the data, we take the mean of the observations that, within a school, 
share the same input vector and use a simple heteroskedastic noise model in which the 
variance for each of these means is divided by the number of observations used to compute 
it. 12 For the convolved GPs, we use one (Q = 1) latent function. The performace measure 
employed is the percentage of explained variance defined as the total variance of the data 
minus the sum-squared error on the test set as a percentage of the total data variance. It 
can be seen as the percentage version of the coefficient of determination between the test 
targets and the predictions. The performace measure is computed for ten repetitions with 
75% of the data in the training set and 25% of the data in the testing set. 

As in the Jura dataset experiment, the initial positions of the inducing points are selected 
using the k-means algorithm with the data points as inputs to the algorithm. The positions 
of these points are optimized in a scaled conjugate gradient procedure together with the 
parameters of the model. Figure [5] shows results of the sparse methods, the ICM model and 



independent GPs. The results for the ICM model are the best results presented in Bonilla 



et al. (2008|. The independent GPs result was also obtained from Bonilla et al. (2008 1 . 
It can be seen that the sparse convolved multiple output GP framework outperforms the 
ICM model and the independent GPs, even with as few as 5 inducing points. FITC and 
PITC slightly outperform the DTC method, which also has greater variances. This dataset 
was also employed to evaluate the performance of the multitask kernels in |Evgeniou and 



Pontil (2004 1 . The best result presented in this work was 34.37 ± 0.3. However, due to 
the averaging of the observations that we employed here, it is not fair to compare directly 
against those results. 

6.4 Transcription factor regulation in the cell cycle of Yeast 

We now consider an application of the multiple output convolutional model in transcrip- 
tional regulation. Microarray studies have made the simultaneous measurement of mRNA 
from thousands of genes practical. Transcription is governed by the presence of absence of 
transcription factor proteins that act as switches to turn on and off the expression of the 
genes. The active concentration of these transcription factors is typically much more diffi- 

12. Different noise models can be used. However, we employed this one so that we can compare directly to 



the results presented in Bonilla et al. ( 2008 1 
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Figure 5: Mean and standard deviation of the percentage of explained variance for exam score 
prediction results on the ILEA dataset. The experiment was repeated ten times. In the bottom of 
the figure DA", FA, PA stands for DTC, FITC and PITC with A inducing values, respectively, 
ICM stands for intrinsic coregionalization model and IND stands for independent GPs. The ICM 



and the independent GPs results were obtained from Bonilla ct al. (2008) 



cult to measure. Several alternative methods have been proposed to infer these activities 
using gene expression data and information about the network architecture. However, most 
of these methods are based on assuming that there is an instantaneous linear relationship 
between the gene expression and the protein concentration. This simplifying assumption 
allows these methods to be applied on a genome wide scale. However, it is possible to obtain 
a more detailed description of the dynamics of this interaction using more realistic models 
that employ differential equations. One example of this type of modeling was presented in 



Barenco et al. (2006 ). Barenco et al. (2006 ) used an ordinary first order differential equation 
to model the interaction between a single transcription factor and a number of genes in a 
biological network motif known as a single input module. A typical dataset of this type 
consists of N measurements of the mRNA abundance level of D genes. The expression level 
fd(t) of gene d at time t is related with the transcription factor protein u(t) through 



dfd 
dt 



B d + S d u(t)-Jdfd(t), 



where B d is the basal transcription rate of gene d, Sd is the sensitivity of gene d to the 
transcription factor and jd is the decay rate of mRNA. Solution for fd(t) was given in 
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equation (11) with Q = 1 and taking into account the additional parameter Bd, it follows 



fd(t) = — + S d f exp(-j d (t - r))u(r)dr. 
Id Jo 

Given some training data for fd(t), the usual way to estimate the dynamics of u(t) is to 
establish an error function and minimize it with respect to each value of u(t) and the 
parameters of the differential equation. 

An alternative way to deal with these differential equations was proposed by [Lawrence 



et al. (2007); Gao et al. (2008). Instead of finding a point estimate for u(t), these authors 
proposed to put a Gaussian process prior over u(t) and use Bayesian analysis to infer the 
posterior distribution for u(t) using the data for fd(t). This corresponds exactly to the 
multiple output convolved Gaussian process framework that we have described in Section 
3j In this case, the latent functions u q (t) correspond to the transcription factor proteins and 
the outputs fd(t) represent the gene expression data. Due to the computational complexity 



issue that we have already discussed, Lawrence et al. (2007); Gao et al. (2008) only dealt 



with a reduced number of genes. The first order differential equation model is obviously 
an oversimplification of transcriptional regulation. However, it considers more aspects of 
the system than clustering, or factor analysis. The sparse approach allows us to apply this 
richer model on a genome wide scale. Also, because this is done within the framework of 
Gaussian processes, it is always possible to add other, perhaps independent, terms to the 
covariance function to deal with any model mismatch. 

As an example, we tested the PITC approximation for the multiple output Gaussian 



process using the benchmark yeast cell cycle dataset of Spellman et al. (1998). Data is 
preprocessed as described in Sanguinetti et al. (2006) with a final dataset of D = 1975 
genes and Q = 104 transcription factors. The data also contains information about the 
structure of the network, basically a matrix of connectivities between transcription factors 
and genes. This is a matrix of size 1975x104, where each entry is either a or a 1, 
indicating the absence or presence of a link between the gene and the transcription factor 
protein. There are TV = 24 time points for each gene. For the PITC approximation, we 
used K = 15 fixed inducing points, equally spaced in the input range. We optimize the 
approximated marginal likelihood through scaled conjugate gradient using 1000 iterations, 
where each iteration takes about 0.72 minutes. Figure [6(a) | shows the expresion level fd(t) 
for ACE2 and in figure [6(b) | the inferred transcription factor u q {t). Equally, figure 6(c) 



shows the expresion level fd(t) for SWI5 and in figure [6(dJ| the inferred transcription factor 
Uq(t). The resulting shape of the transcription factors can be seen as offset versions of the 
shape of the gene expression data, which is a feature in this kind of networks. 

We can also use the sensitivity parameters Sjq for ranking the relative influence of 
a particular transcription factor q, over a particular gene d. In more detail, we use the 
signal-to-noise ratio (SNR) Sd, q /&s dq i defined as the point estimate Sd jQ of the sensitivity 
parameter Sd,q, obtained after the optimization procedure, over the standard deviation for 
that sentitiviy as d ■ An ad-hoc method to estimate this standard deviation consists of 
approximating the mode of the posterior density for the parameters S^ q with a second 
order Taylor expansion: this is known as Laplace's approximation. For details, the reader 



is referred to appendix A. 2 Figure 7(a) shows a histogram of the values of the signal- 



to-noise ratio of the sensitivities of all genes in the dataset with respect to ACE2, this is, 
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all the genes that, according to the connectivity matrix, are regulated by ACE2. Some of 
the highest SNR values are obtained for genes CTS1, SCW11, DSE1 and DSE2, while, for 
example, NCE4 appears to be repressed with a low SNR value. Similar results have been 



reported in other studies (Spellman et al. 1998 Sanguinetti et al. 20061. 
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Figure 6: Gene expresion profile and inferred protein concentration for ACE2 and SWI5. The first 
column shows the gene expression level. The second column shows the mean posterior over the 
transcription factor u q (t) and two standard deviations for the uncertainty. 



Figure [7(b) | shows the signal-to-noise ratio for the sensitivities of all the genes associated 
to the transcription factor SWI5. Among others, genes AMN1 and PLC2 appear to be 
activated by SWI5, as it has been confirmed experimentally by Colman-Lerner et al. (2001 ). 



7. Conclusions 



We have presented several sparse approximations for multiple output GPs, in the context 
of convolution processes. Using these approximations we can capture the correlated in- 
formation among outputs while reducing the amount of computational load for prediction 
and optimization purposes. The computational complexity for the DTC and the FITC 
approximations is 0(NDK 2 ). The reduction in computational complexity for the PITC 
approximation is from 0(N 3 D 3 ) to 0(N 3 D). This matches the computational complexity 
for modeling with independent GPs. However, as we have seen, the predictive power of in- 
dependent GPs is lower. Also, since PITC makes a better approximation of the likelihood, 
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Figure 7: Histograms for the gene specific activities associated to ACE2 and SWI5. In (a), the SNR 
of gene-sensitivities associated to transcription factor ACE2 and in (b), the SNR of gene-sensitivities 
associated to SWI5. 



the variance of the results is usually lower and approaches closely to the performance of the 
full GP, when compared to DTC and FITC. 

With an appropriate selection of the kernel smoothing function we have an indirect way 
to generate different forms for the covariance function in the multiple output setup. We 
showed examples with Gaussian kernels, for which a suitable standardization of the kernels 
can be made, leading to competitive results in high-dimensional input regression problems, 
as seen in the school exam score prediction problem. The authors are not aware of other work 
in which this convolution process framework has been applied in problems with high input 
dimensions. Likewise, convolution appears naturally when solving differential equations in 
dynamical systems and we showed how the sparse methods can be applied to a large scale 
network inference problem. In general, we do not have access to the connectivity matrix, 
so to use this model in those situations we need to put sparse priors over the sensitivity 
parameters. However, our motivation was to show an example where sparse methods like 
the ones we proposed are needed. We obtained sensible results that agree with previous 
literature. 



Recently, Titsias (20091 highlighted how approximations like FITC or PITC can exhibit 
a tendency to overfit when inducing inputs are optimized. Titsias (20091 proposed a vari- 
ational method with an associated lower bound to overcome to some extent the overfitting 



problem. Following the ideas presented here, we can combine easily the method of Titsias 



(20091 and propose a lower bound for the multiple output case. This is part of the future 
work. 
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Appendix A. Derivatives for the sparse methods 

In this appendix, we present the derivatives needed to apply the gradient methods in the 
optimization routines. We present the first order derivatives of the log-likelihood with 
respect to Kff, K u f and K U)U . These derivatives can be combined with the derivatives 
of Kf f , K U) f and K UjU with respect to 6 and employ these expressions in a gradient-like 
optimization procedure. 

We also present the expressions for the Hessian matrix in Laplace's approximation em- 
ployed to compute the uncertainty of the sensitivity parameters in the yeast cell cycle 
example. 



We follow the notation of Brookes (20051 obtaining similar results to Lawrence (2007). 



This notation allows us to apply the chain rule for matrix derivation in a straight-forward 
manner. Let's define G: = vecG, where vec is the vectorization operator over the matrix 
G. For a function C the equivalence between and J^- is given through = ((Jg) :) • 

A.l First Derivatives of the log- likelihood for the gradient methods 

The obtain the hyperparameters, we maximize the following log-likelihood function, 



C(Z,0) 



1 



■loglD + Kf^K-^K^f 



1 



trace 



D + K f U K U ^K u f ) 'yy 



-i 



(17) 



where we have redefined D as D = [Kf f 
notation. Using the matrix inversion lemma and its equivalent form for determinants, 



Kf )U K u |,K Uj fJ M + X, to keep a simpler 



expression (17) can be written as 



£(Z,0) oc-log|K U)U | - ^log|A| - ^log|D| - - trace 



l 

+ - trace 



D- 1 K f>u A- i K u>f D- 1 yy 



-l 



We can find |^ and |^ applying the chain rule to C obtaining expressions for gg£ - , Q ^ 

and q^J~ and combining those with the relevant derivatives of the covariances wrt 6 and 
Z, 



dC dC A 8A:dB: dC u dB: 



dG 



dA: 9D: dG: dD: dG: 



8Ca dA: 8Cg 
9A7(9G: + 5GT 



5gk, 



(18) 



where the subindex in Ce stands for those terms of C which depend on E, G is either Kf f , 
K u .f or K UjU and 5gk is zero if G is equal to Kf t f and one in other case. Next we present 
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expressions for each partial derivative 

dC A 
OA: 
<9D 



1 T dA: 
2 (C ° ' dD: 



K u fD 1 K u fD -1 ' 



<9D: 



((D X HD - 1 ) 



<9D: 



diag(M:), ^g^- = - diag(M:) [(I Kf u K~^) + (Kf jU K~„ ® I) T D ] , 
* - = diag(M:) (Kf )U K~„ K^K^) , — = (K^D" 1 I) + (I K^D" 1 ) T A 



dK u , u 
OA 



5K U „: 



^K uf 



A- 1 K uf D 1 yy T D- 1 



9Ck ^ =-((K-M:) T 
dK u , u : 2 {{ u ' u> 1 ' 



where C = A -1 + A -1 K Uj fD -1 yy T D -1 Kf )U A -:l , H = D — yy T +Kf U A X K U fD x yy T + 
(Kf ;U A~ 1 K U; fD~ 1 yy T ) and To and Ta are vectorized transpose matrices (see, e.g., 



Brookes 2005[ ) and we have not included their dimensions to keep the notation clearer. We 

e 

1 



can replace the above expressions in ( 18 ) to find the corresponding derivatives, so 
dC 



dK {{ : 



((C):) 1 (K u f D K u f D 



((D X HD 1 



((D-^D- 1 ^ 



1 



diag(M:) = ~ (diag(M:) (D _1 JD _1 



2 ((D^JD- 1 M) :) =--(Q:) 



diag(M:) (19a) 

f (19b) 
(19c) 



or simply 



dC 
dK ff 



where J = H — Kf u CK u f and Q = (D X JD 1 0M). We have used the property 

(B:) T (F<g>P) = ((P T BF) :) T in ( ggg and the property diag(B:)F: = (B F):, to go 
from (fl9b|) to ( |19c[ ). We also have 

1 



dC 
0K ur . 



2 (Q:) T [(I Kf, u K-y + (Kf^K-i I) T D ] - - (C:) T 
[(K^fD^ 1 I) + (I K uf D x ) T A ] + ((A- 1 K u , f D- 1 yy T D- 1 
K u,u K u,fQ - CK u f D 1 + A L K U f D Vy T D 1 



(20) 



or simply 



dC 



K-lK u>f Q - CK U fD 1 + A- 1 K Uif D- 1 yy T D 



-i 



where in @, (Q:) T (F I) T D = (Q:) T T D (I F) = (tJ, Q:) T (I F) = (Q:) T (I F). 
A similar analysis is formulated for the term involving Ta- Finally, results for and 



|^ are obtained as 
dC 



5K UjU 



■i (K'i - C - K~„K U) fQKf jU K~„) , || = 
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A. 2 Laplace approximation for the sensitivities 

As an ad-hoc procedure to compute the uncertainty in the sensitivy parameters, we employ 
a Laplace approximation (see e.g. Chapter 4 Bishop, 2006 1 . In particular, consider Q = 1 
and denote by s £ $t. D the vector of sensitivities with entries given by Sj. The Laplace 
aproximation q(s) for the random vector s follows 

q(s) = AA(s|s ,r S0 ), 

where so corresponds to a mode of the log-marginal likelihood for the sparse approximation 
and r^ 1 = — VV£(Z, s, 77)| s=so with rj representing the set of parameters belonging to the 
vector 6 without including s. Asumming that after the optimization procedure we find a 
proper value for so, we need to compute r~ . 

For simplicity, let us denote by Qf f the approximated covariance in the marginal like- 
lihood, this is, Qf ; f = D + Kf )U K~ „K U) f . The log-marginal likelihood is the given as 



£(Z,0)oc-ilog|Q f>f |-i L . 



y 1 Q f f V 



The derivative dC & 8 ^ is equal to 



dC(Z,0) dC dQ t ,f. 



dS d 



dQ 



f,f- 



dS d 



where 



dC 
dQtr. 



-i 



Qf f Vy 1 Q f f 1 



-i 



D x K f U A -iKn fD" 1 . We 



u,fJ 



where the inverse matrix Qf ji is computed using Qf } = D 
assume the sensitivities are independent random variables, so we only need to compute the 
elements in the diagonal of r so . Thus 



<9 2 £(Z,6>) d 



dS d 



dC dQ t r. 



dQ 



f,f : 



dS d 



dC d 2 Q f , f : d 



dQ 
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Finally, in the above expression, we need to compute 
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dS~ d 
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dQ 
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Q ff 1 yy T Q ff 1 
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dQf, f : 



dQf, f : 



dQ 
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dS d 



T 



Q f \ ® Qj} - Q f - f Vy T Q f f 1 ® Q f f 1 - Q f f 1 Q f f Vy T Q f } 



dQ 



f,f : 



dS d 



We do not need to compute the Kronecker products above. Instead, we use the property 
(PBF): = (F T ® P)B:, leading to 



d 
dS~ d 



dC 



dQ 



f,f- 



dS d 
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